Comparative analysis of metagenomic classifiers for long-read sequencing datasets

Background Long reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes. Results General-purpose mappers Minimap2 and Ram achieved similar or better accuracy on most testing metrics than best-performing classification tools. They were up to ten times slower than the fastest kmer-based tools requiring up to four times less RAM. All tested tools were prone to report organisms not present in datasets, except CLARK-S, and they underperformed in the case of the high presence of the host’s genetic material. Tools which use a protein database performed worse than those based on a nucleotide database. Longer read lengths made classification easier, but due to the difference in read length distributions among species, the usage of only the longest reads reduced the accuracy. The comparison of real gut microbiome datasets shows a similar abundance profiles for the same type of tools but discordance in the number of reported organisms and abundances between types. Most assessments showed the influence of database completeness on the reports. Conclusion The findings indicate that kmer-based tools are well-suited for rapid analysis of long reads data. However, when heightened accuracy is essential, mappers demonstrate slightly superior performance, albeit at a considerably slower pace. Nevertheless, a combination of diverse categories of tools and databases will likely be necessary to analyse complex samples. Discrepancies observed among tools when applied to real gut datasets, as well as a reduced performance in cases where unknown species or a significant proportion of the host genome is present in the sample, highlight the need for continuous improvement of existing tools. Additionally, regular updates and curation of databases are important to ensure their effectiveness. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05634-8.


Background
The main task in metagenomic sample analysis is determining its compositionorganisms present in the sample and their quantity.The accuracy of the final result depends on many factors, including contamination with other genetic material (i.e.host's DNA), material isolation, sequencing preparation, and used sequencing technology and classification tools.The recent improvement in both the length and accuracy of long-read sequencing technologies promises a more precise analysis.The advent of high-throughput sequencing has enabled a detailed analysis of microbial communities and their hosts through metagenomics [1,2].Together with genetic material isolation, an essential component of metagenomics sequencing workflows is a computational method for recognising organisms in a sample.Most current methods are originally designed to work with short, accurate reads from secondgeneration sequencing technologies.However, due to an increase in accuracy and throughput, long-read sequencing technologies are gaining popularity.Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the most popular long-read sequencing technologies.
Detection of microbes and their abundances using sequencing technologies can be divided into marker gene (typically 16S rRNA) sequencing [3] and whole-metagenome shotgun sequencing.Since the 16S rRNA gene consists of both conserved and variable regions, it is suitable for cost-effective bacteria and archaea profiling.
On the other hand, whole-metagenome shotgun sequencing covers all genomic information in a sample.It enables additional analyses such as binning into Metagenome assembled genomes (MAGs) or previously reconstructed genome sequences, antibiotic resistance gene profiling, and metabolic function profiling.Metagenomic analysis pipelines often begin by detecting and quantifying the taxa in a sample.When most of the genomes present in the sample are unknown, metagenomic de novo assembly methods (i.e.[4]) are the preferred approach.Otherwise, one can compare sequenced data, mapping it to a reference database that stores genomic information related to various taxa in FASTA format.
This work aims to analyse the performance of methods based on comparing longread sequencing data with a reference database.Several recent studies have proved the value of using long reads in the metagenomic analysis [5][6][7].Although there are several benchmarking papers on long reads [7][8][9][10], our evaluation additionally includes HiFi PacBio reads, and standard long reads mappers and incorporates an assessment of the influence of the database, read lengths and definition of abundance measures on the results in simulated and real use cases.We also assessed trade-offs between running time and memory requirement.
We evaluated the performance of thirteen different pipelines using seven synthetic datasets, three datasets obtained from mock communities and six real gut microbiome datasets.Some tools are based on nucleotide databases, while others need protein databases.
To distinguish between two MEGAN-LR versions, the one which uses a protein database and the one which uses a nucleotide database, we named them MEGAN-P and MEGAN-N, respectively.Bracken [12] is a statistical method that computes species abundance using taxonomy labels assigned by Kraken/Kraken2.We adapted Minimap2 and Ram for metagenomics classifications.Mappers usually enable full alignment mode (calculating alignment path) and mapping mode (calculating approximate alignments).We assessed Minimap2 in both modes and Ram only in mapping mode.
There are two main requirements for classification algorithms: identifying present organisms using a taxonomic rank (species in this work) and evaluating their abundances.Reaching these objectives highly depends on the community's content and the number of reads for each organism.Therefore, using existing PacBio and ONT reads from isolates, we synthesised several simple to more complex communities containing 3 to 50 species, varying from highly abundant to very sparse.
-Datasets ONT1, PB1, and PB4 reflect a community of bacteria without eukaryotic species.-Datasets ONT2 and PB2 reflect metagenomics datasets with one or more eukaryotic species and many bacterial species, respectively.-Dataset PB3 reflects a community with predominantly human reads (99%) and two low-abundance bacterial species, reflecting what one might see in an infection setting.-Datasets PB1 + NEG and PB2 represent a situation where a significant portion of the reads comes from an organism that is not present in the database and has no similar present organisms.For the PB1 + NEG dataset, we created randomized reads from the human genome.For the PB2 dataset, we added reads from D. melanogaster and human isolate datasets.
We also used three well-defined mock community datasets PB_Zymo (PacBio HiFi sequenced Zymo Gut Microbiome Standard dataset), ONT_Zymo (ONT sequenced Zymo Gut Microbiome Standard dataset) and PB_ATCC (PacBio HiFi sequenced ATCC sample).It is important to notice that while we used reads sequenced with older PacBio technologies for synthesised communities, mock communities were sequenced using Sequel 2 HiFi technology.In all tests with mock community datasets, we included PB_ Zymo, although the differences between reported and expected values were much larger than for the other two datasets.Due to this disproportion, we will focus more on other datasets in our analysis and conclusion.
Finally, we compared the tools using six real gut datasets.Three datasets are sequenced by ONT technologies (SRR15489009, SRR15489011 and SRR15489017) and three by PacBio HiFi (Sample10, Sample20 and Sample21).
The tools were tested in four areas: 1. Read level classification-how accurately can they classify each read.2. Abundance estimation-how well can they be used to estimate the abundance of organisms in the sample.3. Organism detection-how accurately can they detect organisms in a sample.4. Computational resource usage-running time and consumption of RAM.
We focus our analysis on microbial species in all areas mentioned above apart from read-level classification, where we assessed tools on species and genus ranks.Accuracy and abundance errors were calculated only for the microbial organisms, ignoring reads assigned to the human.

Read-level classification
We assessed the tools' read-level classification accuracy on seven synthesised datasets.In addition to species taxonomy rank, we analysed read accuracy on genus level to evaluate the performances of mappers when some of the species in the sample were not present in the database.Figure 1 shows that general purpose mappers and mapping-based tools outperformed others on almost all datasets and both levels.Differences between them and kmer-based tools varied up to 10% at species levels.The only exception was MEGAN-N which performed similarly to kmer-based tools.Pipelines which use a protein database underperformed significantly.As shown in Additional file 1: Table S12, protein tools had much fewer true positive classification and, on average, a high number of false positive classification resulting in lower accuracy on both species and genus level.
Minimap2 with alignment outperformed other tools, followed by mapping-based tools (deSAMBA, Metamaps) and Ram.Interesting cases were ONT1 and ONT2 datasets which contain reads of two species of the Vibrio genus that were not in databases.Since there were other similar species of the Vibrio genus in the database, some tools, such as MEGAN-N and Minimap2 in both modes, tended to assign those reads to them.In contrast, other tools, such as CLARK-S and Ram, tended to leave those reads unassigned.Therefore, the results on the ONT1 and ONT2 datasets for these four tools were almost reversed when analysing genus and species levels.CLARK-S and Ram had the highest accuracy when inspecting the ONT1 and ONT2 datasets at the species level and among the lowest when examining the dataset at the genus level.In contrast, Minimap2 and MEGAN-N had the highest accuracy at the genus level but performed worse at the species level.
Since there is an imbalance in the number of reads per species, we additionally calculated the F1 score for each class (organism in the sample) separately and averaged them (F1 macro average).Using F1 macro average instead of accuracy shows a similar pattern for most datasets with Minimap2 (both modes) surpassing others and a narrower distance between mapping-based and kmer-based tools (Additional file 1: Fig. S1).
We further investigate the influence of read length on classification.We only present analysis for Minimap2 with alignment and Kraken2, the most accurate tool at the read level and the fastest tool respectively.As evident from Fig. 2, increasing the read length led to a higher classification level.However, we could not select only the longest reads due to different read length distributions per organism.Additional analysis on how using 30% of longest reads impacted the results is provided in Additional file 1: Table S1.

Abundance estimation
Abundance estimation is arguably the most important assessment.Tools report different types of abundance.Most assessed tools report relative read counts as abundance.However, abundance might be calculated in different ways.Other measures count genomes or cells.The difference between these two is that the latter takes into account ploidy.The vendors of mock community samples usually report all of them.Here we also tested Bracken, a method that predicts genomic DNA length using read lengths.This is irrelevant for short-read sequencing because all reads are the same size.However, long-read sequencing technologies produce reads whose length might vary greatly, especially for ONT technology.
In the analysis, we compared L1 distances between reported and declared abundances using relative read count, genome length and genome count.Additional file 1: Table S5 shows the theoretical composition of Zymo and ATCC datasets.
Table 1 shows that the results are mostly consistent, considering different ways of abundance calculation (in detail analysis is presented in Additional file 1: Fig. S3).The ranking of tools by different measures slightly varies.However, it is important to note the difference between read counts and genome length measures.Although distances do not vary much for PacBio HiFi data and estimations of genomic DNA abundances are slightly more accurate when read counts are used, differences for ONT reads are noticeable and adding the length of reads increases the accuracy of the prediction.The reason for this behaviour might be in read distributions.While it is narrow for PacBio HiFi reads, it is wide and skewed for ONT reads.
Inspired by medical laboratory tests which use cell count to measure the abundance, we continue with relative genome count measure in the rest of the paper.
In the next test, we evaluated the tools on databases with and without the human genome (containing only bacterial and archaeal genomes) for datasets with a higher The table shows the total abundance estimation error for mock community datasets for three definitions of relative abundance: (1) the percentage of reads classified to a species, (2) the percentage of genome lengths (sum of read lengths) classified to a species and (3) the percentage of genomes classified to a species.The total error was calculated as the L1 distance between specific types of abundances reported by tools and the abundances declared by vendors and summed up across all organisms (all present and all reported).In a row, cells with values close to the best are highlighted with a blue background, and those close the worst are marked in red.The best and worst values are further set apart with bold text and a more intense hue of their corresponding background colors.Unfortunately, due to its long-running time, results for MEGAN-N are unavailable for datasets PB_Zymo and PB_ATCC.It is important to note that Bracken produces only read counts assigned to a taxonomic rank.Therefore, to compare it with other tools, abundances-the percentage of genomes of species in the sample, were calculated by normalising read counts with the average genome length of the species to which corresponding read counts were assigned percentage of the human genome.The comparison of the abundance estimation error is given in Table 2.
Table 2 shows that when the human reads' percentage is lower, tools such as CLARK-S, Ram, Metamaps, Kaiju, and MEGAN-P were unaffected by the presence of the human genome in the database.However, as the human genome reads' percentage approaches 100%, the accuracy of most tools significantly declines.Exceptions were tools which used protein databases, and their abundance estimation slightly changed.In the case of the highly represented human genome, Metamaps achieved notably better results than other tools.However, its error almost quadrupled.Therefore it is important for each experiment to prepare a database which would include all organisms (including the host) that might be contained in the sample.
There was no clear winner when we included the human genome in databases.Metamaps performed best on ONT2, Ram on PB2, and CLARK-S on PB4.Centrifuge and protein-based tools were significantly worse than others.It is important to note that Bracken significantly improved Kraken2 results for PacBio datasets.
To analyse abundance error in more detail, we separately calculated the error for species present in the sample and for species not present in the sample but incorrectly reported by tools.We calculated the L1 distance between reported and expected abundance in percentages.Figure 3 and Additional file 1: Fig. S2 show the results.
Minimap2 align outperformed other tools in absolute differences between abundances of present organisms.In most of the datasets, its mean difference was below 2%, yet, other tools were not far away.Tools did not achieve good results on the PB_Zymo dataset.One of the reasons is the lack of two species in databases, Veillonella rogosae and Prevotella corporis, which represent more than 26% of the dataset.However, even if we did not consider these two species, distributions of errors for this dataset were much wider than in another PacBio mock community dataset, PB_ATCC (Additional file 1: Fig. S2).
Regarding species not present in the dataset, CLARK-S surpassed others, followed by MetaMaps, Ram and MEGAN-N.Minimap2 was more prone to reporting organisms The table shows the total abundance estimation error for datasets PB2, PB3 and ONT2 (datasets that contain human reads) for all tools and two databases: a database with the human genome and a database without the human genome.The error was calculated as the L1 distance between abundances calculated for each tool and true abundances.In a row, cells with values close to the best are highlighted with a blue background, and those close the worst are marked in red.The best and worst values are further set apart with bold text and a more intense hue of their corresponding background colors.Each dataset name is followed by the percentage of human reads in that dataset in parentheses not present in the sample.Therefore we deem there is space for improvement in the postprocessing analysis (e.g. using sequence similarity threshold or EM approach similar to MetaMaps) or by changing parameters such as kmer length or the percentage of filtered kmers.

Real data without ground truth
We also compared pipelines on six real gut microbiome datasets, for which the ground truth was unknown.We compared tools in two ways, with the results viewed as vectors.Firstly, a distance matrix between the tool results was calculated, and a hierarchical clustering algorithm was applied.Secondly, a PCA algorithm was performed on the tools' outputs, obtaining the two most significant components.The results of both analyses for datasets Sample10, Sample20, Sample21, SRR15489009, SRR15489011 and SRR15489017 are shown in Fig. 4. From Fig. 4, it can be seen that kmer-based tools such as Kraken2, Bracken and Centrifuge performed similarly.Clark was very close for most samples.While for PacBio HiFi datasets, mappers Minimap2 and Ram, and Metamaps performed similarly, for Fig. 4. 2D PCA and dendrograms for real datasets.Abundance estimation data for each tool is viewed as a vector, with components being abundance estimations for each organism.The data was transformed using PCA, and the two most significant components were plotted.Hierarchical clustering was also performed on initial vectors.The Figure displays 2D PCA plots and hierarchical clustering dendrograms for all real datasets.Data for MEGAN-N is unavailable on all datasets due to dataset sizes.Data for MEGAN-P is unavailable on datasets SRR15489009 and SRR15489017 ONT datasets Ram and Metamaps performed differently than Minimap2 tools.deSA-MBA was usually somewhere in between these kmer-based and mapping-based tools but usually closer to kmer-based tools.CLARK-S, Kaiju and MEGAN-P were far from other tools.
We analysed the ten most abundant species for each pipeline for the SRR15489009 (PacBio) dataset to better understand the differences between pipelines.Results are presented in Additional file 1: Table S11.Here, we saw a significant difference in content between protein and nucleotide databases.Although belonging to the core [32] of the human gut, Lachnospiraceae bacterium species are not found in the protein database.It is important to note that CLARK and CLARK-S did not recognise it, although they use the nucleotide database, which contains them.Protein database does not contain another prevalent [33] human gut bacteria-Eubacterium rectale.Furthermore, while mappers and all mapping-based tools recognised it (average abundance of 5%), none of the kmer-based tools did the same.Other important human gut species, such as Prevottela copri [34] and Ruminococcus torques [35], were not present in the NCBI-NT database.
A further interesting question for real gut datasets is the actual number of species present in the sample.This depends on the coverage.However, the total read length for PacBio HiFi and ONT datasets is similar, so we compared the reported results.Additional file 1: Table S3 shows the percentage of reads classified for all datasets and tools.The least number of classified reads were for three ONT gut samples, where best-performing tools rarely classified above 50% of reads.The numbers for PacBio samples were even above 80% for some tools.This might be because of the higher error rate and shorter read length of ONT data.Although there is a difference between the PacBio HiFi datasets (PB_Zymo and PB_ATCC), it is much smaller.The remaining 20% of unclassified reads might be explained by missing species in databases (Additional file 1: Table S10).
While Additional file 1: Table S4 shows the number of detected species for each real dataset and each tool, Table 4 shows a more detailed analysis of real gut samples.CLARK-S, Metamaps and Ram reported a significantly lower number of sequences for real gut datasets sequenced by PacBio HiFi technologies than all other tools.Since they performed better for real mock community datasets PB_Zymo and PB_ATCC, we hypothesise the number of species possible to detect for selected sequencing depth was closer to the numbers they reported and probably even smaller.

Computational resource usage
Results for running time and memory usage are presented in Table 5 (summary) and in Additional file 1: Table S2 (in detail).As expected, kmer-based tools, apart from CLARK-S and Bracken, surpassed others in the running time.For our synthetic test datasets, Centrifuge has the lowest running time for most datasets.Compared to mappers Minimap2 and especially Ram, the difference between the best kmer-based tools and mappers was below one order of magnitude.Among mapping-based tools, deSAMBA was only a few times slower than the fastest kmer best tools, while MetaMaps and MEGAN-N were among the slowest.Kaiju was among the fastest, and MEGAN-P was among the slowest tools.Ram used the least amount of memory.Kraken2, Centrifuge, Minimap2, MEGAN-P and MEGAN-N, for most datasets, used 2-3 times more memory.deSAMBA, CLARK, CLARK-S and MetaMaps used 10-15 times more memory.The table shows the number of detected species on real datasets for three different thresholds: 1, 10, and 50.A threshold represents the minimum number of reads assigned to a species to consider it present in the sample.Reads assigned to higher taxa are considered unclassified.The table also shows the percentage of unclassified reads for each dataset.Due to dataset sizes, results for MEGAN-N are unavailable.Results for MEGAN-P are unavailable on datasets SRR15489009 and SRR15489017 We additionally tested Ram and Minimap2 execution times by mapping only one sequence to the whole database file.The execution time for both was around 1000 s, suggesting that the database parsing and indexing took that much time.Both tools could improve execution time by storing and loading preprocessed database indexes to the disk.
For Bracken, we analysed the running time and memory consumption for the database building procedure because it needed to be executed for every dataset independently since datasets had different average read lengths, a parameter required by Bracken.The abundance estimation script executed almost instantaneously.
We analysed the scalability of used tools on several different dataset sizes.The results are presented in Table 6.Even for the largest datasets, Ram was still, at most, around 10 × slower than Kraken2, the fastest kmer-based tool.Although Centrifuge was the fastest tool when analysing execution times presented in Table 5, Kraken2 had the lowest execution times when tested on larger datasets.This happened because, for smaller datasets, index loading took up a large part of the execution time, and Centrifuge had the smallest database index.On larger datasets, where the actual sequence classification took a greater part of the execution time, Kraken2 outperformed Centrifuge.
All resource usage measurements were performed on a machine with sufficient disk space, 775 GB RAM and 256 virtual CPUs (2 × AMD EPYC 7662 64-Core Processor).Measurements were performed using 12 threads for 7 synthetic and 3 mock datasets and using 32 threads for 6 gut datasets.We cleared RAM Cache, File system buffer, and Swap space between runs.

Discussion
In this manuscript, we analysed four categories of tools for microbial classifications: kmer-based tools, mapping-based tools, general-purpose long-read mappers and tools which use protein databases.In most of the tests, tools which use protein databases performed worse than other categories.However, it would be interesting to additionally test them using RNAseq data, especially considering that long reads technologies can sequence the whole transcripts.
Mapping-based tools apart from deSAMBA were much slower.They all required a lot of memory while rarely performing better than other tools.General purpose mappers, The table shows running time (in seconds) and memory usage (in GB) for all tools and datasets.The table shows the minimum and maximum values for both measures for all synthetic and real datasets, which are about 1 Gbp in size.In a row, cells with values close to the best are highlighted with a blue background, and those close the worst are marked in red.The best and worst values are further set apart with bold text and a more intense hue of their corresponding background colors.
It is important to note that since Bracken relies on Kraken's output, the peak memory requirement is determined by the higher of the two tools' maximum values and the total execution time should be the sum of executions times especially Minimap2 in the alignment mode, achieved the best results in most tasks.Ram achieved good results, and its low memory consumption makes it suitable for analysis on laptops.However, mappers were slower than kmer-based tools.Considering this, kmer-based tools (Kraken2 and Centrifuge) will still be the tools of choice in most scenarios, especially in differential studies.An interesting aspect is the detection of present organisms.Most tools, especially those based on k-mers, overreported the number of present species.In their analysis, McIntyre et al. [37] found that different strategies, including abundance filtering, ensemble approach and tool intersection, help for short reads.Some tools, such as Kraken2, had an internal option for setting up the threshold for each read based on the number of find kmers.However, even these strategies could not completely eliminate false positives.A detailed assessment of various filtering and ensembling strategies is out of the scope of this paper, and it is a promising avenue for future research.Here, we show that setting up read thresholds reduces the number of false positives but can lead to a reduction in true positives.Moreover, longer reads help to increase accuracy.In parallel with our work, Portik et al. [10] report the same.Finally, testing on mock communities showed that a lower error rate of PacBio reads might reduce false positives.
A special case in this analysis was PB_Zymo dataset.Differences between declared and calculated abundances for this tool are much larger than for other tools The same we noticed for all tools.Reasons for these might be in the upstream procedures, including sample preparation, DNA extraction and sequencing.Tools were run on three mock community datasets (ONT_Zymo, PB_ATCC and PB_Zymo) subsampled to three different sizes to test their scalability.All datasets used in the main part of the paper were subsampled to 1Gbp (file size is about 2 GB) to make the testing viable even for slower tools.However, all three datasets were used in their original size, subsampled to half of the original dataset and subsampled to 1 Gbp.The table also shows the execution time for real gut microbiome datasets in their full size.Since Bracken, MEGAN-N, MEGAN-P and MetaMaps running times were prohibitively long on larger datasets, we omitted them from the analysis.Additionally, deSAMBA was unable to process some larger datasets completely.The table shows execution time in seconds Although we didn't know the ground truth for the real gut datasets, they helped us to understand the limitations and importance of carefully curated databases.For example, differences in calculated absolute abundances between tools were up to 50%, and all kmer-based tools didn't report one species often found in the gut microbiome.This shows that using just one pipeline for sensitive experiments is probably insufficient.In addition, despite protein-based tools performing worse in this benchmark, it would be interesting to use them in the case when large amounts of reads are not associated with taxa in the nucleic database.We have found there are differences in taxa between nucleic and protein databases, so protein-based tools might indicate the presence of some taxa not contained in the nucleic database.
Analysis conducted on samples containing species not found in the database and real gut samples highlights the significant advantages of using meticulously curated databases.To ensure maximum fairness in this study, we created databases for each tested tool using the same input sequences.Created databases contain only complete sequences.For many taxa, only partial information was available.However, even among complete sequences, many unknown nucleotides or amino acids exist.
In this work, we used a measure we call the relative genome count abundance, which considers the length of reads and genome size.Using read counts and calculating ratios of genomic DNA for each species or counting genomes or cells is an open question.Recently, Sun et al. [38] also warned of the importance of distinguishing these measures.It is important to note that the accuracy improved when we added read lengths to calculate the abundance for the ONT Zymo dataset.However, the results were slightly worse for mock communities sequenced by the PacBio HiFi technology.Investigating this in detail will be one of our next aims.
In the field of metagenomics, an ongoing challenge is how to effectively benchmark classification tools using metagenomic samples.In this study, we utilised real reads, mock communities, and varied ratios of present organisms in datasets to benchmark the performance of tools.Some researchers, such as McIntyre et al. [37], have used synthetic reads to create datasets that mimic the complexity of real samples.Although we also incorporated simulated datasets in our analysis, we acknowledge that these datasets may only partially replicate the complexity of real-world cases.However, most tools, including Minimap2, Ram, Kraken2, DeSamba, MEGAN-N, MEGAN_P, CLARK, CLARK-S, and Kaiju, analyse each read independently, and complex samples can be simulated by combining simple datasets linearly without significantly influencing the final results.Furthermore, we explored samples with logarithmic differences in species abundance.Our analysis revealed that different tools report varying species compositions for real gut datasets, and the true complexity of real samples remains incompletely understood.Therefore, we posit that a comprehensive study is needed to elucidate the differences between these two approaches to dataset preparation and their impact on the benchmark results.
This benchmark study shows that there is a discordance between tools, and we argue that there is room for improvement in all tools.A useful upgrade would be to provide some information about the confidence of whether the read belongs to a similar species or it doesn't belong to any species in the database.
Since the existing deep learning models, such as DeepMicrobes [39] and BERTax [40], have been rarely used and primarily focus on short reads, we did not include them in this benchmark.However, due to rapid progress in the usage of AI in all other fields and ample sequenced data, we believe that using deep learning methods, including those based on language models, will be one of the new avenues in this field.

Conclusion
The assessment on real gut datasets shows that tools report a different number of species and different abundances.Therefore we deem that classifying microbial organisms is still an open problem.
Preparing the proper datasets for evaluation presents notable challenges.We select reads from both isolated strains and sequenced mock communities.However, isolated strains cannot encompass the full spectrum of species variations, as not every species can be isolated.Mock communities, while useful, introduce their own biases, leading us to propose a future experiment: sequencing the same mock community multiple times using various sequencing technologies.This would provide insights into biases arising at different stages, including mock community preparation, genome extraction, library preparation, and the sequencing process itself.While read simulation offers an alternative, this method has its drawbacks, as it fails to account for the various biases related to sample preparation and they usually cannot perfectly reproduce the real sequencing results.
Despite the challenges we have in the evaluation we deem that due to their speed, kmer-based tools are still the first option for most analyses.However, if the time is not too critical or there is not enough RAM, general purpose mappers, including Mini-map2 and Ram, are a good alternative which provides higher accuracy in most scenarios.Modern mappers use fewer kmers to calculate mapping candidate positions, which makes them faster.We believe that with further improvement in long-read sequencing technology, most methods will move to detect smaller numbers of kmers in combination with chaining matches.They will probably need an additional postprocessing step using methods such as the EM algorithm or careful filtering to reduce the number of false positives.
In this manuscript, we use relative genome count as an abundance measure, and we focused on datasets constructed using real reads from previous experiments, mock communities or real gut samples.However, we are aware of the used abundance measure and benchmark dataset preparation, and their alternatives have limitations.Hence, we recognise the need for further research in defining appropriate abundance measures and improving metagenomics benchmark dataset construction techniques in the field.
Although there are just a few deep-learning methods for microbial classification, we expect a sharp increase in their number in the following months and years.
The results obtained from the analysis of real gut samples clearly indicate that using different strategies and database types may be necessary to accurately determine microbial content.Furthermore, our findings strongly support the notion that there is ample room for improvement in systematically incorporating key microbial species into existing databases containing reference genomes and protein sequences.We believe that advancements in long-read sequencing technologies will greatly facilitate and enhance this process.

Methods
This chapter describes the tools and assessment measures used and how we constructed the test datasets and databases.

Databases
Each classification tool comes with a prebuilt default database and instructions on building and using a custom database.We created a database for each tool based on the same set of organisms to remove bias related to default databases.We used NCBI-NT and NCBI-NR databases for creating protein and nucleotide databases used in this work We downloaded bacteria, archaea, and human sequences whose assembly_level property in the assembly_summary.txtfile (downloaded from the RefSeq directory) is "Complete Genome" or "Chromosome".In the following text, refer to the two databases that have been created as the protein and nucleotide databases.
The list of all the downloaded files and all the sequence ids of the databases can be found in the GitHub link: (https:// github.com/ lbcb-sci/ Metag enomi csBen chmark).
Genome sequences were downloaded on April 5th 2020, along with the taxonomy files nodes.dmpand names.dmp.This approach allowed the tools to be tested independently of the content of their default database.The details of how each database index was created for every tool are presented in Additional file 1: Materials S3.
The main part of the tests was performed on a database constructed for each tool from the same set of sequences: protein databases created from NCBI-NR database for tools using a protein database and nucleotide database created from NCBI-NT database for tools using a nucleotide database.We used all bacterial and archaeal genomes, plus the human genome.The nucleotide database contains 9044 tax ids, while the protein contains 11,435 tax ids.There are 552 tax ids in the protein database that are outside the nucleotide database, and there are 2943 tax ids that are in the protein database but not in the nucleotide database.It is important to highlight that only one species (Helcococcus kunzii-40,091) is present in mock datasets and the protein database but not in the nucleotide database.Additional file 1: Table S10 shows a list of dataset species not present in databases.The taxonomy data was downloaded from this website: https:// ftp.ncbi.nlm.nih.gov/ pub/ taxon omy/ acces sion2 taxid/ and genome sequences were downloaded from this website: https:// ftp.ncbi.nlm.nih.gov/ genom es/ refseq/.

Test datasets
To have realistic sequencing datasets while retaining control of our mock communities' exact content and building the ground truth, we constructed in silico datasets by mixing real reads from isolated, sequenced species.Data was downloaded from multiple sources (details in Additional file 1: Table S7), including the European Nucleotide Archive (ENA [41]) and the National Center for Biotechnology Information (NCBI [42]).This in-silico approach provides ground truth and great flexibility to create diverse datasets while offering real reads with their natural errors and length variance.Most datasets contained around 100,000 reads, allowing all tools to classify them within a week.We varied the proportion of species, some with even distributions, some with decreasing ratios with as little as five reads for one species (PB4 dataset).Seven test datasets were synthesised with the following composition: two ONT, four PacBio and one negative dataset containing PacBio and randomised reads.
All datasets that do not contain human reads were mapped to the human reference with minimap2 to check if there are human reads' contaminations.No sequences that belong to non-human species mapped to the human genome with a significant quality.
In addition to synthetic datasets, the tools were tested on three real datasets obtained by sequencing mock metagenomic communities.The results reported by the tested tools were used to calculate abundances and compared to standard specifications obtained from manufacturer pages.Standard, consisting of 16 bacteria and one yeast, with the expected abundance varying from 0.0001 to about 20% (download from NCBI archive, SRA run identifier: SRR13128014).However, for this dataset, the results obtained by all tools differed significantly from the specification.
Finally, the tools were tested on six real human gut microbiome datasets.Three datasets were produced as a part of the CaPES study at the Genome Institute of Singapore and published on ENA under the project ID PRJEB29152 (Bertrand et al. 2019).Datasets used for our test: -Sample10-11.9GB -Sample20-6.4GB -Sample21-14.8GB All three samples were sequenced using an R9.5 spot-on flowcell.Sample 10 was basecalled using Albacore, while samples 20 and 21 were basecalled using Guppy v0.3.0 for live 1D basecalling.
The other three datasets were generated as a part of a clinical trial by Siolta Therapeutics [6].All generated data is available in the NCBI database under the BioProject accession number PRJNA754443.All long-read datasets were sequenced using a PacBio Sequel II.Datasets used in our testing: -SRR15489009-12 GB -SRR15489011-11 GB -SRR15489017-14 GB
Tools start with the initial assignment of reads to genomes using in-advance-prepared databases of known organisms.Once when all reads are assigned, various methods are used to fine-tune the classification using information from assigned reads and taxonomy trees.The most popular postprocessing approaches are Expectation-Maximization (EM) estimation (MetaMaps, Centrifuge), Bayesian estimation (Bracken) and read assignment using the least common ancestor approach (MEGAN-LR, Kraken2).
The initial assignment of reads is based on aligning reads to a database of determined genomes.Aligning (Fig. 5) might be divided into three steps: (1) Searching for exact or approximate matches of short substrings of length k (kmers) or longer in a previously prepared index which contains a list of kmers from genomes, (2) Chaining kmer matches into a sequence, scoring the sequence, finding approximate positions of a read in a genome (mapping), and choosing the best genome candidates (3) Alignment of a read and candidate genomes using exact dynamic programming algorithm.While kmerbased tools use only the first step, mapping-based tools use the first and second or all three.Each additional step adds to accuracy but significantly increases the running time.
Usually, kmers are of a fixed size.The original approach used all sliding windows of size k in a sequence.This might lead to high accuracy, but it is too slow.Therefore, modern tools usually use just a few discriminative kmers per genome or choose a lexicographically smallest kmer in a window of w consecutive kmers, i.e. minimiser [45].
The outputs of various tools were processed to obtain read-level classifications and an abundance of various species in a sample.Short descriptions and versions of each tool are available in Additional file 1: Materials S1. Specific parameters and scripts used to run each tool are given in Additional file 1: Materials S2.
Furthermore, we could not successfully run our version of the MEGAN-N pipeline on the PB3 synthetic dataset and on PB_Zymo and PB_ATCC real datasets.In the case of the PB3 dataset, the mapping phase using the LAST aligner would go on for a week, and after that, the CPU and memory usage would drop down to almost zero, but the process would not be complete.Output produced in that way was corrupted and could not be used for testing.After three trials, we decided to drop the results.For PB_Zymo and PB_ ATCC datasets, the LAST aligner produced a huge MAF file with correct alignments, which we could not convert to an alignment out file (.DAA).This resulted in no classified reads.Huge MAF files and long running times were also a reason why we did not run MEGAN-N on larger datasets.MEGAN-P uses a protein database, produces significantly smaller MAF files, runs faster, uses less RAM and is able to complete most tests.MEGAN-P did not finish on the real gut microbiome datasets SRR15489009 and SRR15489017.Although MAF files produced by the LAST aligner seemed correct, the script used to convert MAF files into DAA files needed for further analysis produced a file with the data for only the first read.
Minimap2 and Ram are not intended for metagenomics classifications and often print several mapping results for a single sequence.Therefore we classify each sequence from the PAF output files using a harmonic mean: where the ML (mapping length) and NM (number of matches) are found in each row of the PAF file.For the SAM output files with alignment, the best classification for each sequence was determined by the highest alignment score.In the case of ties, the first alignment was taken as correct.This work aims not to maximise mappers' performance but to demonstrate their suitability for metagenomic classification.Detailed analysis of ties presented in Additional file 1: Table S8 shows that more careful handling of ties might lead to further improvement in read classifications.

Genome lengths
We acquired average genome lengths for species used in abundance calculation from the NCBI website: https://www.ncbi.nlm.nih.gov/genome/?term=<species_name> (e.g.https:// www.ncbi.nlm.nih.gov/ genom e/? term=e.coli for e.coli).We scraped the NCBI website for the "median total length (Mb)" string to get the genome length of the species.For species that were unsuccessfully scraped, we acquired the genome length from the same website manually.

Testing procedures
The tools' output was processed to obtain percentages of DNA reads and species' abundances in the sample.We evaluated the correctness of DNA read classification at species and genus level, i.e., only classifications that were assigned to a tax id which belongs to the species or lower level were used in the species-level analysis; and only classifications assigned to the genus or a lower level were used in the genus-level analysis.Outputs of the tools, which contained classification of reads to taxons, were processed.Taxonomic ids and ranks were extracted from the nodes.dmpfile downloaded from the NCBI website.

Read-level classification
To evaluate the quality of read-level classification, we first calculated four basic values: -True positives (TP): the number of reads that were classified to a correct species.
-False positives (FP): the number of reads that were classified as an incorrect species.

×
ML × NM ML + NM -True negatives (TN): the number of reads that remained unclassified and belonged to an organism not present in the database.-False negatives (FN): the number of reads that remained unclassified but belonged to an organism present in the database.
These four values were then used to calculate more complex and valuable evaluation metrics.The first metric used is classification accuracy-the percentage of reads that were correctly classified.
Global accuracy score was calculated for each read regardless of which organism it belonged to, only based on whether it was classified as TP, TN, FP or FN.As such, it did not reflect the proportion of each species in the sample.Incorrectly classified reads belonging to very abundant organisms will have the same effect as those belonging to very low abundance organisms.Because it is less biased towards larger classes, we used the F1 macro average score: we calculated the F1 score for each class (organism in the sample) separately and averaged them.We prefer this measure to the global F1 score, calculated across all classes.F1 macro average gives an equal impact on the score for each organism in the sample independently of its abundance.
Because the F1 score is zero for classes not in the database (as the number of true positives is zero), those classes were omitted from the calculation.

Abundance
Abundance represents the percentage of genomes of a specific taxon in the sample.Abundances calculated by benchmarked tools significantly differ due to differences in definitions and calculations.Here we calculated abundances for all tools using so-called relative genome count abundance measure.In addition to genome sizes in the calculation, we added read lengths due to the skewed and wide distribution of ONT reads.We calculated the abundance of a species as the sum of the lengths of assigned reads divided by its average genome length in the database.Obtained values were normalised in a manner that the total sum of abundances is 1.We calculated C i , a sequenced coverage of genome i, and A i , a relative abundance of genome i as where L i,j is the length of a read j assigned to genome i, G i length of genome i, and C k is the sequenced coverage that belongs to any of the reported genomes.We use L1 norm for measuring the accuracy of predicted abundances of taxa in a sample at rank r.The L1 norm is given by where x r and x * r are true and predicted abundances, respectively.We choose L1 instead of L2 because the former is less sensitive to outliers.
Genome lengths were obtained from the NCBI web page.Abundance estimations were assessed on species or lower-level classifications, considering all higher-level taxa classifications incorrect.

Organisms detection
Although some benchmarks, including [37], used precision and recall as measures for assessing organism detection, we decided to show only the number of true positives and the total number of reported organisms.This helped us to make a similar assessment for all datasets, including those for real gut datasets where we reported only the number of detected organisms.

Resource usage
Processor time and RAM usage were measured using a server with 775 GB RAM and 256 virtual CPUs (2 × AMD EPYC 7662 64-Core Processor).Tools were run using 12 threads for 7 synthetic and 3 mock datasets and 32 threads for 6 gut datasets.The measurements were taken using a fork of cgmemtime (https:// github.com/ isovic/ cgmem time) modified to write its output to a file.

Fig. 1
Fig. 1 Read level classification accuracy, comparison between species and genus level classification.Kmer-based are represented in red, mapping-based tools are represented in blue and protein tools are represented in green.Plot (a) shows species-level classification for which reads are considered correctly classified if classified to a correct species.Plot (b) shows genus-level classification for which reads are considered correctly classified if classified to a correct genus.Plot (c) shows mean values for both levels.Results for MEGAN-N are unavailable for the PB3 dataset

Fig. 2
Fig. 2 Comparison between classification accuracy and read length.The figure shows Q1, median and Q3 read length for true positive and false positive read classifications for each dataset.False positive read lengths are considered only for organisms in the database.The results shown in the figure were obtained using Minimap2 with alignment and Kraken

Fig. 3
Fig.3Abundance estimation error on species level-heatmap.gure 3. Abundance estimation error on species level-heatmap.Abundance estimation error was calculated by comparing the abundances calculated for each tool to the ground truth.Errors were calculated separately for species present in the sample (a) In-sample error L1) and for species erroneously reported by tools as present when they are absent from the sample (b) Out-of-sample error L1).The heatmap's cell color intensity directly reflects the magnitude of the estimation error, with red indicating higher errors and blue indicating lower errors.Notably, in the PB_Zymo dataset, species Veillonella rogosae (taxId: 423,477) and Prevotella corporis (taxId: 28,128) account for 19.94% and 6.26% of the composition, respectively, but are absent from the database."

-
ONT_Zymo: obtained by GridION sequencing of a Zymo Community Standard, consists of eight bacteria and two yeasts with the expected abundance varying from 0.37 to 21.6% (downloaded from Loman Labs https:// loman lab.github.io/ mockc ommun ity/).-PB_ATCC : obtained by PacBio HiFi sequencing of an ATCC MSA-1003 standard (20 Strain Staggered Mix Genomic Material), consists of 20 different bacterial species with the expected abundance varying from 0.02 to 18% (download from NCBI archive, SRA run identifier: SRR11606871).-PB_Zymo: obtained by PacBio HiFi sequencing of a Zymo D6331 Gut Microbiome

Fig. 5
Fig. 5 Read alignment.Read alignment consists of three steps (1) Indexing and kmer search, (2) Chaining and scoring (3) Alignment.Kmer-based tools use only the first step, and usually, they do not care about the position in the genome.Mapping-based tools use the first and second steps, which increase accuracy but last much longer.The alignment step provides the exact alignment and the score but increases the running time ACC =TP + TN TP + FP + TN + FNPrecision(PR) = TP TP + FP Recall (RC) = TP TP + FN F 1 = 2 × PR × RC PR + RC C i = j L i,j G i

Table 1
Comparison of the total abundance estimation error between relative read count, relative genome length and relative genome count abundances

Table 2
Comparing abundance estimation error for the database with the human genome and the database without the human genome

Table 3
True positive and false positive organism detection

Table 4
The number of detected species for real gut microbiome datasets

Table 5
Resource usage

Table 6
Execution time for datasets of different sizes